Discriminative Identification of Duplicates

نویسندگان

  • Peter Haider
  • Ulf Brefeld
  • Tobias Scheffer
چکیده

The problem of finding duplicates in data is ubiquitous in data mining. We cast the problem of finding duplicates in sequential data into a poly-cut problem on a fully connected graph. The edge weights can be identified with parameterized pairwise similarities between objects that are optimized by structural support vector machines on labeled training sets. Our approach adapts the similarity measure to the data and is independent of the number of clusters. We present three large margin approximations of learning the pairwise similarities: an integrated QPformulation, a sequential multi-class approach and a pairwise classifier. We report on experimental results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Language Identification and Multilingual Speech Recognition Using Discriminatively Trained Acoustic Models

We perform language identification experiments for four prominent South-African languages using a multilingual speech recognition system. Specifically, we show how successfully Afrikaans, English, Xhosa and Zulu may be identified using a single set of HMMs and a single recognition pass. We further demonstrate the effect of language identification-specific discriminative acoustic model training ...

متن کامل

Identification and Analysis of Genes and Pseudogenes within Duplicated Regions in the Human and Mouse Genomes

The identification and classification of genes and pseudogenes in duplicated regions still constitutes a challenge for standard automated genome annotation procedures. Using an integrated homology and orthology analysis independent of current gene annotation, we have identified 9,484 and 9,017 gene duplicates in human and mouse, respectively. On the basis of the integrity of their coding region...

متن کامل

Previously unidentified duplicate registrations of clinical trials: an exploratory analysis of registry data worldwide

BACKGROUND Trial registries were established to combat publication bias by creating a comprehensive and unambiguous record of initiated clinical trials. However, the proliferation of registries and registration policies means that a single trial may be registered multiple times (i.e., "duplicates"). Because unidentified duplicates threaten our ability to identify trials unambiguously, we invest...

متن کامل

Discriminative Training of GMM for Language Identificatio..

In this paper, a discriminative training procedure for a Gaussian Mixture Model (GMM) language identification system is described. The proposal is based on the Generalized Probabilistic Descent (GPD) algorithm and Minimum Classification Error Rates formulated to estimate the GMM parameters. The evaluation is conducted using the OGI multi-language telephone speech corpus. The experimental result...

متن کامل

In-set/out-of-set speaker identification based on discriminative speech frame selection

In this paper, we propose a novel discriminative speech frame selection (DSFS) scheme for the problem of in-set/out-of-set speaker identification, which seeks to decrease the similarity between speaker models and background model (or antispeaker model), and increase the accuracy of speaker identification. The working scheme of DSFS consists of two steps: speech frame analysis and discriminative...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006